Cleaning up our products dataset is the next step in our analysis. This includes determining the degree of missing data, removing unnecessary columns, looking for outliers, and, if necessary, reformatting.

This information would likely be more useful in a percentage format, allowing us to quickly determine whether columns with a significant number of missing rows are required for our analysis.

Here we see a few columns with a significant percentage of values missing. Using heatmap, we can see exactly how many rows are missing.

This allows us to identify the columns that have a significant amount of missing data so that we may decide which columns to delete from our dataset.

Finding and dealing with outliers in our dataset is the next stage of our cleanup. Potential outliers must be dealt with right once because they can change later calculations and representations of the data as a whole. To look for outliers in our numerical columns, we will utilize kurtosis (a measure of the tailedness or skew of data points relative to the center of a distribution). Outliers within the set are more likely to occur when kurtosis levels are higher.

The price_usd column in the Products dataset, which represents the pricing in U.S. dollars, has a noticeably larger kurtosis value than the other numerical columns. Python's describe() method can be used to search for an outlier on either the left or right side of the distribution.

We also observe that there is a substantial possibility of outliers in the columns associated with the feedback counts. Since they don't relate to our analysis tasks, these columns won't be used in this project.

Identifying which columns in the dataframe are required for our analysis will be the next step in cleaning. Their exclusion might be justified by 1) Repetition (mostly redundant columns) 2) Relevance (how well it applies to our analysis). Completion (too many NaNs and nulls to be useful). By printing the columns with more than 50% of their rows having the same value, we may determine whether any columns are redundant. To determine which columns offer information, we can also print the most frequent values for each column.

We can now see the columns from the Products dataset that have over 50% of the same value stated. - Majority of the columns variation_desc, Value_price_usd and Sale_price_usd was NaN, These three columns will be removed from the dataframe since they primarily contain NaN values. Limited edition, brand-new, online-only, out-of-stock, and sephora-exclusive are all boolean values. These columns won't be removed due to redundancy because they all have boolean data types (True/False). These columns, however, offer no insight for our analysis, hence they will be removed from the dataframe due to relevance. The child_max_price and child_min_price columns primarily contain NaN values and are not important to our research because the child_count column is repetitious in nature. So, they will be removed from the dataframe. The is_recommended boolean column is irrelevant, and the helpfulness columns primarily contains NaN values. Although the user feature-related columns may be beneficial in subsequent analyses, they won't help us with our sentiment and text analysis.

With 20 of the 40 columns now present in the updated Products dataframe, df1, we have significantly fewer missing values to work with and a little smaller dataset for simpler computations and processing.

We will look at product categories and ingredient trends in this section.

The results showl that the data on products was broken down into nine "primary" categories, from which "secondary" and "tertiary" categories were used to further filter the data. A large percentage of the data is composed of products that fall under the skincare category. This may be useful to us when we do the sentiment and text analyses of the skincare reviews dataset.

The results for the subsidiary categories reveal a longer list than the key categories, with 41 categories as opposed to our original 9.

As we narrow down our search, as expected, the number of separate groups dramatically grows, with 118 within tertiary.

Cosine Similarity Analysis is a technique we can use to identify the products that are most similar to one another. By calculating the cosine of the angle between two vectors in a matrix, we may use this method to determine the similarity of two texts regardless of their size differences. The text strings are first transformed into word vectors in a matrix. Next, we calculate the angle between the matrix's vectors and provide a score between 0 and 1, with values closer to 0 indicating less similarity and values closer to 1 indicating greater similarity. To start, we'll make a new dataframe with just the columns we need.

The final dataframe's index is then reset when products that do not have ingredients listed are removed. To later match the product indices to the one that is most similar, the index must be reset.

For this work, we'll use the TfidfVectorizer from Python's sklearn module. A numerical metric called TF-IDF (term frequency - inverse document frequency) shows how important a particular word is to a document. The ingredient lists can be turned into vectors using TfidfVectorizer.

This generates a 7549x7549 sparse matrix largely made up of zeros.

To work with it with the numpy module, we will transform this to an array. By calculating the argmax of each row, we can determine the index of the list of ingredients that is the most comparable. We can utilize the list of index that this returns to loop through the dataframe containing all of the current products. First, since the 1 values in the array represent how similar a product is to itself, we must mask them as NaN values.

After getting the argmax, we now have an array of the product indices that are most comparable to the row they are located in. We can see that the generated array has a size of 7549 rows by 1 column. To use the array with the Pandas library, we will turn it into a dataframe object.

The values from the product_name column will be taken in the following step and assigned to the "products" variable. The "most_sim_index" column of indices will then be taken and assigned to a variable. Using our two new variables to loop through each row, we can then create a new column called "most_sim_product" that contains the names of the items at each of the indices.

Some of the results are returning items that are simply the travel size variant of the same product, which is not really helpful to us. We will need to change our initial dataframe in order to remove these from the results.

Finally, a new dataframe based on the cosine similarity of the ingredients gives us the most comparable products. Building on this, we can now get the names and costs of the most comparable items using the same method we used to iterate the indices, as this information may be useful to us.

We would be able to examine the sentiment and sales performance of each relevant product using this dataframe together with a sentiment and sales dataframe. Then, we may assess how well they performed in relation to their acquisition costs and profitability to decide whether or not their most similar counterpart should to be offered to customers instead.

We are interested in the products and brands that are regarded as the best or worst, as well as if the emotional tone of the reviews is positive or negative. We will use sentiment analysis to collect the information we need for this task.

We already have the columns that is needed: rating (a number from 1 to 5, with 5 being the best) and review_text (the complete text of a review). To determine what to anticipate, we will look at the overall score distribution.

This shows us that the majority of reviews are also likely to be positive because the majority of ratings are positive. A wordcloud, which is a depiction of the most frequently occurring terms across several texts, might help us understand what is being said the most. Larger words indicate more frequent usage.

Our data frame has a large number of rows, applying the preprocess_text function to each row can be time-consuming. Processing a large amount of text data can be computationally intensive. Therefore, instead of applying the preprocess_text function to each row individually, we used vectorized operations provided by pandas to process the entire column at once. This can significantly speed up the computation.

The overall wordcloud can offer some insight, but it would be even more instructive to observe what is mainly being said in reviews that we categorize as positive and those that we categorize as negative. To do this, we must first organize the reviews based on their scores. We will exclude reviews that contain a score of 3, which is regarded as the midpoint between the two spectrums of our rating system. Then, we will give ratings between 1-2 a negative value and ratings between 4-5 a positive value.

The positive wordcloud, after some editing, enables us to understand what is being expressed in positive reviews: Words like "acne prone", "dry skin", "oily skin", "soft skin", and "sensitive skin" appear in the wordcloud, indicating that customers are considerably more likely to discuss their skin condition in relation to the product. Words like "routine", "favorite", "obsessed", "staple", "game changer", and most memorably, "Holy Grail" are signs of newly returning customers as a result of their encounter with a product. Customers frequently cite the entire product category when reviewing a specific product (e.g., "serum," "sunscreen," "toner," "lip balm," "eye cream"), drawing similarities to previous purchases.

Words like "money," "overpriced," "worth," "waste," "buy," and "price" within the wordcloud are immediately noticeable since they relate to a product's effectiveness in relation to its price. When a product doesn't live up to expectations, customers are more angry, and this is especially true for bigger investments. When a review is critical of a product, customers frequently include their skin type (e.g., "sensitive skin," "acne prone," "oily skin," etc.). Negative reviews also tend to focus more on the product's shortcomings, with words like "fragrance," "formula," "packaging," "sticky," "texture," and "scent" appearing in the wordcloud. The verbs in the wordcloud, such as "sting," "drying," "break outs," "burn," and "irritating," highlight some of the most typical adverse skin reactions to products. Visualizing the sentiment we created through the ratings column allows us to corroborate our initial assessment of the dataset.

The majority of the reviews are positive, as expected.

To create a classification model that forecasts whether reviews are positive or negative, we can use logistic regression.

We'll make a new dataframe with just the review title and sentiment columns.

Since the data is very large with over 1.3 million records, the run time for model fitting is drastically long, so to improve the run time, we will try using stratified sampling to create a smaller representative subset of our data instead of using the entire dataset

It appears that the target variable is Imbalanced. When the total number of one class of data is significantly higher than the total number of another class of data, this situation is known as class imbalance in machine learning. Machine learning models generally overclassify the larger class when there is a class imbalance in the training data because of their increased prior probability. Logistic regression, which we will be utilizing for our model,  is estimated by maximizing the log-likelihood objective function formulated under the assumption of maximizing the overall accuracy. With regard to the unbalanced data, that is not true. The resulting models tend to be biased towards the majority class,, which can result in significant loss in practice.

We will downsize the Majority class , so both classes will be equal

We were able to address the issue of imbalanced data and improve the distribution.

The dataframe will now be divided into train and test sets. 20% of the data will be used for testing, with the remaining 80% being used for training.

The text will now be converted into a bag of words (BoW) model, which is effectively a matrix of how frequently each word appears. Due to the fact that logistic regression is unable to understand text, we must convert to a BoW model.

By visualizing the confusion matrix and inspecting the classification report, we can assess the accuracy and performance of our logistic regression model for sentiment analysis.

Heatmap of the confusion matrix, where the x-axis represents the predicted labels and the y-axis represents the true labels. The numbers in the cells indicate the counts of samples for each combination of predicted and true labels.

When we look at the confusion matrix, we can see that there were 3,213 true positives—positive results that were truly determined to be positive—were recorded. 1,564 false positives—positive results that were mistakenly believed to be negative—were recorded. There were 319 false negatives, or predictions of negative results that were in fact positive. There were 4,508 true negatives, or predictions that turned out to be false.

The classification report will provide metrics such as precision, recall, F1-score, and support for each class (positive and negative sentiment). It will give us a detailed overview of the model's performance.

According to the classification report, our model produced an overall accuracy of 80% without any feature extraction or significant preprocessing. This model might be improved and used to incoming data for categorization. For instance, Sephora can provide a coupon code for a different brand to customers whose product reviews the model projected would be negative and a coupon code for a comparable product to customers whose reviews the model indicated would be positive.

We can plot the ROC curve to visualize the trade-off between the true positive rate (sensitivity) and false positive rate (1-specificity) at different classification thresholds. This curve can help assess the model's ability to distinguish between positive and negative classes. The ROC curve should ideally be closer to the top-left corner, indicating higher sensitivity and specificity.

The precision-recall curve shows the trade-off between precision and recall at different classification thresholds. It is especially useful when dealing with imbalanced datasets.The precision-recall curve aims for higher precision and recall values. A curve that stays closer to the top-right corner indicates better model performance.

This chart shows us that Logistic Regression has the highest accuracy compared to SVM and Random Forest models, while the Random Forest model has the lowest performance.

By analyzing ROC and Precision-Recall curves, we can also assess how well these classifiers perform in terms of their ability to correctly classify positive instances and avoid misclassifying negative instances. Generally, a higher AUC and a curve closer to the top-left or top-right corner suggest better classifier performance.

ROC Curve: TPR (True Positive Rate) represents the proportion of true positive instances correctly classified as positive. FPR (False Positive Rate) represents the proportion of negative instances incorrectly classified as positive. The curve shows the TPR-FPR trade-off at different classification thresholds. A good classifier has higher TPR and lower FPR, closer to the top-left corner. AUC (Area Under the Curve) measures overall performance: higher AUC indicates better classification, with 0.5 being random and 1.0 being perfect. Precision-Recall Curve: Precision is the proportion of true positive instances among those predicted as positive. Recall (Sensitivity) is the proportion of true positive instances correctly classified as positive. The curve shows the trade-off between precision and recall at different classification thresholds. A good classifier has higher precision and recall, closer to the top-right corner. AUC measures overall performance, with higher values indicating better classification.

We can also see here that when compared to the SVM and Random Forest models, Logistic Regression has the highest accuracy. The performances of SVM and Logistic Regression are also very close to eachother, whereas the Random Forest model is the least accurate model of the three.

When we lookat the printed example predictions, we can see that Logistic Regression and SVM predicted the sentiments accurately, where as Random Forrest model was less accurate with one false positive prediction

For visualization, we will obtain the top 10 brands and items in terms of loves_count (the total number of times a product has been designated as a favorite) and rating (the overall average rating of the product).

There are 142 different brands and 2,333 different products in the dataset.

From this, we can say that the LANEIGE brand, which has 1.97 million loves_count across the site, is leading among the others.

Out of the roughly 300 brands on the site, Erno Laszlo is the least "loved" brand on Sephora.

We can also see how few loves the bottom 5 performances have accumulated.However, the ratings for each brand are not very useful because the average rating for most brands is between 4 and 5. As with the top/bottom items, there is little noticeable difference between the top performers at either end of the 1–5 scale.

Our produced "sentiment" column's total can be visualized to show the difference between all positive and negative brand reviews. As a result, CLINQUE now leads the way.

Creating a difference also allows us to identify when performers go into the negatives, as is shown with TWEEZERMAN, the brand that performs lowest across the entire website.

Lip Sleeping Mask Intense Hydration with Vitamin C, our best product in terms of sentiment, performs remarkably better than the other top 4, with about 11,500 positive difference compared to the roughly 6,000 positive difference attained by the others.

The products that performed the poorest on the website all managed to score lower than zero, with Clean Cleansing & Gentle Exfoliating Wipes having the worst performance.